Introduction

The purpose of the project is to provide an easy, consumable form of skill rating for professional Call of Duty players. This could be used as a descriptive statistic, but it can also be used to guess the winner of the largest tournament of the year, Call of Duty Champs.


What is the Call of Duty World League?

Call of Duty is a first-person shooter that first began in 2003. Since then, it has become one of the largest multiplayer video game franchises to exist. During this time, a competitive scene for the game has gained traction. In 2016, the Call of Duty World League was born – a sponsored league that hosts major tournaments throughout the year for the best players in the world to play in. In these events, these pros play three different game modes to decide the winner of a series. These game modes are Hardpoint, Search and Destroy, and then a third game mode that often changes yearly. For the data that we are covering, the third game mode is Control. All of the teams in the league consist of 5 players, and the series are Best of 5’s.

embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s")

Game Modes in Circuit

Hardpoint

In Hardpoint, the two teams must fight over a point on the map where every second they spend in this point, they gain one point. This point is called the “hardpoint.” If two teams are in the hardpoint at the same time, then neither teams collects points. Every sixty seconds, the hardpoint changes locations on the map, so teams must make tactical decisions to be able to rotate across the map. The first team to 250 points wins the map.

embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s") %>%
  use_start_time(6*60 + 35)

Search and Destroy

In Search and Destroy, the two teams play rounds where each player only has one life; if you die, you are dead until the next round. The objective is to either kill the entire other team before the time limit, or if you are on offense, then you can plant the bomb. If the bomb detonates after 45 seconds, then you also win the round. The first team to win 6 rounds wins the map.

embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s") %>%
  use_start_time(18*60 + 39)

Control

In Control, there is an offense team and a defense team. There are multiple rounds where each team switches off between offense and defense. Each team has 30 lives per round. The first time to win three rounds wins the map. The offensive team is trying to either capture two points on the map, or eliminate all 30 lives of the other team. The defensive team is trying to either defend the two points before the time rounds out, or eliminate all 30 lives of the other team.

embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s") %>%
  use_start_time(45*60 + 40)

Load Packages


Load Data

This project makes use of official CWL data that is uploaded on Github. All data is organized relatively cleanly and all missing data is reported.

proleague2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-07-05-proleague.csv"))
fortworth2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-03-17-fortworth.csv"))
london2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-05-05-london.csv"))
anaheim2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-06-16-anaheim.csv"))
proleagueFinals2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-07-21-proleague-finals.csv"))

# all stats for all major tournaments (EXCEPT CHAMPS) in BO4 (2019)
majors2019 <- rbind(proleague2019, fortworth2019, london2019, anaheim2019, proleagueFinals2019)

# champs will act as our test data; we will try and predict the winner
champs2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-08-18-champs.csv"))

What data are we concerned with?

In order to assign an overall score to each individual player, we will need to address Hardpoint, Search and Destroy, and the Control separately. Once we have an individual score for each of the three game modes, we can use these to determine a final score.

Hardpoint: 1. player – what player does the data correspond to
2. mode – game mode
3. win – ‘W’ or ‘L’; use to find overall player win/loss ratio
4. k_d – kill/death ratio; used to show overall impact on the map
5. assists – in addition to k/d, assists show overall support on the map; higher assists can indicate better team work
6. accuracy_percent – player accuracy for each match
7. damage_dealt – total damage done in the map
8. player_spm – score per minute
9. hill_time_s – hill time measured in seconds
10. hill_captures – shows activity on the map (MIGHT INCLUDE)
11. hill_defends – shows activity on the map (MIGHT INCLUDE)
12. match_id – helpful for getting rid of missing data

Search and Destroy: 1. player – what player does the data correspond to
2. mode – game mode
3. win – ‘W’ or ‘L’; use to find overall player win/loss ratio
4. k_d – kill/death ratio; used to show overall impact on the map
5. assists – in addition to k/d, assists show overall support on the map; higher assists can indicate better team work
6. accuracy_percent – player accuracy for each match
7. damage_dealt – total damage done in the map
8. player_spm – score per minute
9. fb_round_ratio – ‘snd_firstbloods’/‘snd_rounds’ (NOT INCLUDED IN BASE DATA SET)
10. bomb_sneak_defuses – sneak defuses are often in pivotal rounds
11. bomb_plants – good indicator of role (MIGHT INCLUDE)
12. bomb_defuses – good indicator of role (MIGHT INCLUDE)
13. match_id – helpful for getting rid of missing data

Control: 1. player – what player does the data correspond to
2. mode – game mode
3. win – ‘W’ or ‘L’; use to find overall player win/loss ratio
4. k_d – kill/death ratio; used to show overall impact on the map
5. assists – in addition to k/d, assists show overall support on the map; higher assists can indicate better team work
6. accuracy_percent – player accuracy for each match
7. damage_dealt – total damage done in the map
8. player_spm – score per minute
9. match_id – helpful for getting rid of missing data


Data Split


Data Cleaning and Organization

The data below is for all of the majors throughout the season, except for COD Champs. We will reserve COD Champs to act as a test set. The raw data from each major is merged into one major dataset, further broken up into Hardpoint, SND, and Control datasets.

All Majors 2019 data

# CLEANING
majors2019 <- majors2019 %>% clean_names(.)

# new dataset that contains all of the missing data, just in case
majors2019_missing <- sqldf('SELECT * FROM majors2019 WHERE match_id LIKE "missing%"')

# whole event data, all players and all maps, where player names are organized alphabetically
majors2019 <- majors2019[order(majors2019$player),]

# removes missing values
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE match_id NOT LIKE "missing%"')

# calculates all the players that have played more than 50 games
player_numgames <- count(majors2019, player) %>% subset(., n > 50) %>% remove_cols(n)

# includes all existing data for all players that have played more than 50 games (arbitrary number)
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE player IN player_numgames')

# removes all matches where damage = 0; almost always occurs as a result of data loss
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE damage_dealt != "0"')

# changes W to 1, L to 0
majors2019$win <- ifelse(majors2019$win == "W", 1, 0) %>%
  as.factor()

# assigning a role to each player to allow for more precise comparisons
playerRoles <- majors2019 %>%
  group_by(player) %>%
  count(player, fave_weapon) %>%
  top_n(1, n) %>%
  mutate(role = fave_weapon) %>%
  subset(select = -c(fave_weapon, n))

# replace fav gun with corresponding role
playerRoles$role <- str_replace(playerRoles$role, "Saug 9mm", "1")
playerRoles$role <- str_replace(playerRoles$role, "Maddox RFB", "2")
playerRoles$role <- str_replace(playerRoles$role, "ICR-7", "3")

# making factors
playerRoles$role <- factor(playerRoles$role)

# manually adjustment for player TJHaly
playerRoles <- playerRoles[-c(83), ]

majors2019 <- dplyr::inner_join(playerRoles, majors2019, by = "player")

A player’s role is defined as a sub (1), flex (2), or an ar (3).

Hardpoint subset

# all 2019 hardpoint data
hp2019 <- sqldf('SELECT player, k_d, role, win, kills, deaths, x, assists, damage_dealt, player_spm, hill_time_s, hill_captures, hill_defends FROM majors2019 WHERE mode == "Hardpoint"')
hp2019 <- hp2019[order(hp2019$player),]

Search and Destroy subset

# all 2019 SND data
snd2019 <- sqldf('SELECT match_id, team, player, role, win, kills, deaths, k_d, assists, damage_dealt, player_spm, bomb_sneak_defuses, bomb_plants, bomb_defuses, snd_rounds, snd_firstbloods FROM majors2019 WHERE mode == "Search & Destroy"')

# adds new column with fb/round ratio
snd2019 <- add_column(snd2019, fb_round_ratio = snd2019$snd_firstbloods/snd2019$snd_rounds)

# adding a new column with average first bloods for the season
snd2019 <- snd2019 %>%
  group_by(player) %>%
  mutate(fb_avg = mean(snd_firstbloods))

# puts data in alphabetical order
snd2019 <- snd2019[order(snd2019$player),]

Control subset

# all 2019 CONTROL data
control2019 <- sqldf('SELECT player, role, win, k_d, assists, damage_dealt, player_spm FROM majors2019 WHERE mode == "Control"')
control2019 <- control2019[order(control2019$player),]

Champs 2019 dataset

champs2019 <- champs2019 %>% clean_names(.)
champs2019 <- champs2019[order(champs2019$player),]
champs2019 <- sqldf('SELECT * FROM champs2019 WHERE match_id NOT LIKE "missing%"')
champs2019 <- sqldf('SELECT * FROM champs2019 WHERE damage_dealt != "0"')

# changes W to 1, L to 0
champs2019$win <- ifelse(champs2019$win == "W", 1, 0) %>%
  as.factor()

champs2019 <- dplyr::inner_join(playerRoles, champs2019, by = "player")

Hardpoint CHAMPS subset

# CHAMPS 2019 hardpoint data
hpChamps <- sqldf('SELECT player, k_d, role, win, kills, deaths, x, assists, damage_dealt, player_spm, hill_time_s, hill_captures, hill_defends FROM champs2019 WHERE mode == "Hardpoint"')
hpChamps <- hpChamps[order(hpChamps$player),]

Search and Destroy CHAMPS subset

# CHAMPS 2019 SND data
sndChamps <- sqldf('SELECT player, win, role, k_d, assists, damage_dealt, player_spm, bomb_sneak_defuses, bomb_plants, bomb_defuses, snd_rounds, snd_firstbloods FROM champs2019 WHERE mode == "Search & Destroy"')

# adds new column with fb/round ratio
sndChamps <- add_column(sndChamps, fb_round_ratio = sndChamps$snd_firstbloods/sndChamps$snd_rounds)

# adding a new column with average first bloods for the season
sndChamps <- sndChamps %>%
  group_by(player) %>%
  mutate(fb_avg = mean(snd_firstbloods))

# puts data in alphabetical order
sndChamps <- sndChamps[order(sndChamps$player),]

Control CHAMPS subset

# CHAMPS 2019 CONTROL data
controlChamps <- sqldf('SELECT player, role, win, k_d, assists, damage_dealt, player_spm FROM champs2019 WHERE mode == "Control"')
controlChamps <- controlChamps[order(controlChamps$player),]

Team Hardpoint Data

# getting all necessary data for hardpoint
mergedhp2019 <- sqldf('SELECT match_id, team, player, role, kills, deaths, win, assists, damage_dealt, player_spm, hill_captures, hill_defends FROM majors2019 WHERE mode == "Hardpoint"')

# organizing by each match
mergedhp2019 <- mergedhp2019[order(mergedhp2019$match_id),]

# removing all matches that DON'T include all 10 players
# calculates all the matches that have all 10 players
match_numplayers <- count(mergedhp2019, match_id) %>% subset(., n == 10) %>% remove_cols(n)

# includes matches where all 10 players have existing data
mergedhp2019 <- sqldf('SELECT * FROM mergedhp2019 WHERE match_id IN match_numplayers')


# merge rows so that all the players from each team are one row; expect 800 observations with about 50 variables
test_mergedhp2019 <- mergedhp2019 %>%
  rename(damage = damage_dealt,
         spm = player_spm,
         hillcaptures = hill_captures,
         hilldefends = hill_defends) %>%
   mutate(rn = rowid(match_id, team)) %>% 
   pivot_wider(names_from = rn, values_from = c(win, 
                                                player, 
                                                kills,
                                                deaths,
                                                assists, 
                                                damage, 
                                                spm, 
                                                hillcaptures, 
                                                hilldefends)) %>%
  subset(select = -c(win_2, win_3, win_4, win_5,
                     player_1, player_2, player_3, player_4, player_5)) %>%
  rename(win = win_1)

# team_mergedhp2019 <- test_mergedhp2019 %>%
#   group_by(match_id, team) %>%
#   mutate(kills = sum(kills_1, kills_2, kills_3, kills_4, kills_5),
#          deaths = sum(deaths_1, deaths_2, deaths_3, deaths_4, deaths_5),
#          kd = kills/deaths,
#          assists = sum(assists_1, assists_2, assists_3, assists_4, assists_5),
#          spm = mean(spm_1, spm_2, spm_3, spm_4, spm_5),
#          hillcaptures = sum(hillcaptures_1, hillcaptures_2, hillcaptures_3, hillcaptures_4, hillcaptures_5),
#          hilldefends = sum(hilldefends_1, hilldefends_2, hilldefends_3, hilldefends_4, hilldefends_5),
#          damage = sum(damage_1, damage_2, damage_3, damage_4, damage_5)) %>%
#   subset(select = c(win, kd, assists, spm, hillcaptures, hilldefends, damage))

Team SND Data

# getting all necessary data for hardpoint
team_snd2019 <- sqldf('SELECT match_id, k_d, role, team, player, win, assists, damage_dealt, player_spm, bomb_sneak_defuses, bomb_plants, bomb_defuses, snd_firstbloods, snd_rounds FROM snd2019')

# organizing by each match
team_snd2019 <- team_snd2019[order(team_snd2019$match_id),]

# removing all matches that DON'T include all 10 players
# calculates all the matches that have all 10 players
match_numplayers <- count(team_snd2019, match_id) %>% subset(., n == 10) %>% remove_cols(n)

# includes matches where all 10 players have existing data
team_snd2019 <- sqldf('SELECT * FROM team_snd2019 WHERE match_id IN match_numplayers')


# merge rows so that all the players from each team are one row; expect 800 observations with about 50 variables
team_snd2019 <- team_snd2019 %>%
  rename(kd = k_d,
         damage = damage_dealt,
         spm = player_spm,
         fb = snd_firstbloods, 
         rounds = snd_rounds,
         defuses = bomb_defuses,
         plants = bomb_plants,
         nd = bomb_sneak_defuses) %>%
   mutate(rn = rowid(match_id, team)) %>% 
   pivot_wider(names_from = rn, values_from = c(win, 
                                                player,
                                                kd,
                                                role,
                                                assists, 
                                                damage, 
                                                spm, 
                                                fb, 
                                                rounds,
                                                defuses,
                                                plants,
                                                nd)) %>%
  subset(select = -c(win_2, win_3, win_4, win_5,
                     player_1, player_2, player_3, player_4, player_5,
                     rounds_2, rounds_3, rounds_4, rounds_5,
                     match_id, team)) %>%
  rename(win = win_1) %>%
  rename(rounds = rounds_1)



# team_snd2019 <- team_snd2019 %>%
#   group_by(match_id, team) %>%
#   mutate(kills = sum(kills_1, kills_2, kills_3, kills_4, kills_5),
#          deaths = sum(deaths_1, deaths_2, deaths_3, deaths_4, deaths_5),
#          kd = kills/deaths,
#          assists = sum(assists_1, assists_2, assists_3, assists_4, assists_5),
#          spm = mean(spm_1, spm_2, spm_3, spm_4, spm_5),
#          damage = sum(damage_1, damage_2, damage_3, damage_4, damage_5),
#          fb = sum(fb_1, fb_2, fb_3, fb_4, fb_5),
#          fbratio = fb/rounds_1,
#          plants = sum(plants_1, plants_2, plants_3, plants_4, plants_5),
#          defuses = sum(defuses_1, defuses_2, defuses_3, defuses_4, defuses_5),
#          nd = sum(nd_1, nd_2, nd_3, nd_4, nd_5)) %>%
#   subset(select = c(win, role, kd, kills, deaths, assists, rounds_1, damage, fbratio, plants, defuses, nd))

Exploratory Data Analysis

For my exploratory data analysis, I will be using just the season data. It will not include the Champs data.

Kill/death for season

ggplot(majors2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 3.5)) + labs(y = "Kill/death ratio", x = "Player", subtitle = "OVERALL Player K/D's, 2019 Season (BO4), Descending")

ggplot(hp2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 3.5)) + labs(y = "Kill/death ratio", x = "Player", subtitle = "Player K/D's for HARDPOINT, 2019 Season (BO4), Descending")

ggplot(snd2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 5)) + labs(y = "Kill/death ratio", x = "Player", subtitle = "Player K/D's for SEARCH AND DESTROY, 2019 Season (BO4), Descending")

ggplot(control2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 3.5)) + labs(y = "Kill/death ratio", x = "Player", subtitle = "Player K/D's for CONTROL, 2019 Season (BO4), Descending")

Search and Destroy First Bloods

Search and Destroy is a gamemode that has multiple rounds, where in each round, every player only has one life. A “first blood” is the first kill of the round and is usually highly influential. This a common stat that commentators and the community look at.

Firstblood average

# player firstblood average for SND 2019

ggplot(snd2019, aes(x = reorder(player, fb_avg), y = fb_avg)) + geom_point() + coord_flip(ylim = c(0, 3)) + labs(y = "Firstblood Average", x = "Player", subtitle = "Player Firstblood Average for SEARCH AND DESTROY, 2019 Season (BO4), Descending")

Firstblood totals

# player firstbloods for SND 2019

ggplot(snd2019, aes(x = reorder(player, snd_firstbloods), y = snd_firstbloods)) + geom_boxplot() + coord_flip(ylim = c(0, 6)) + labs(y = "Firstbloods", x = "Player", subtitle = "Player Firstbloods for SEARCH AND DESTROY, 2019 Season (BO4), Descending")

Firstblood/round

# player firstblood/round for SND 2019

ggplot(snd2019, aes(x = reorder(player, fb_round_ratio), y = fb_round_ratio)) + geom_boxplot() + coord_flip(ylim = c(0, 0.6)) + labs(y = "Firstblood/round ratio", x = "Player", subtitle = "Player Firstblood/Round for SEARCH AND DESTROY, 2019 Season (BO4), Descending")

Overall Damage Dealt

# player damage dealt OVERALL 2019

# removes all entries where damage is 0; this is almost always a result of data loss
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE damage_dealt != "0"')
playerDamage <- sqldf('SELECT player, damage_dealt FROM majors2019 WHERE damage_dealt != "0"')

ggplot(playerDamage, aes(x = reorder(player, damage_dealt), y = damage_dealt)) + geom_boxplot() + coord_flip(ylim = c(0, 10000)) + labs(y = "Damage Dealt", x = "Player", subtitle = "OVERALL Player Damage Dealt, 2019 Season (BO4), Descending")

Overall Score/Min (spm)

# Overall score per minute for 2019 season

ggplot(majors2019, aes(x = reorder(player, player_spm), y = player_spm)) + geom_boxplot() + coord_flip(ylim = c(0, 675)) + labs(y = "Score per minute", x = "Player", subtitle = "OVERALL Player Score per minute, 2019 Season (BO4), Descending")

Number of Wins

# Overall number of wins for 2019 season

playerwins <- sqldf('SELECT player, win FROM majors2019 WHERE win == "1"') # selects all the wins for each player
playerwins <- playerwins %>% count(player) # counts the number of wins per player

ggplot(playerwins, aes(x = reorder(player, n), y = n)) + geom_bar(stat = 'identity') + coord_flip() + labs(y = "Number of Wins", x = "Player", subtitle = "OVERALL Number of Wins per Player, 2019 Season (BO4), Descending")

The top 4 players with the most amount of wins in the season are Slasher, Octane, Kenny, and Enable. The interesting part about this is that all of these players were on the same team, 100 Thieves. They all tied with 116 wins during the season.

playerwins %>%
  ggplot(aes(x = n)) + geom_histogram(binwidth = 15, color = "black", fill = "white")

The number of wins appears to follow a normal distribution. The left side of the histogram appears to be slightly more populated, but I hypothesize that this is due to players that didn’t play for the whole season.


(Bad code) Correlation between variables

I will begin by working with hardpoint data. In order to do this, I will need to compile each players’ statistics into one row.

I will be working with a new dataset, just so I don’t mess up the original dataset.

hp2019_mb <- hp2019

Number of wins for each player.

hp_playerwins <- sqldf('SELECT player, win FROM hp2019_mb WHERE win == "W"') # selects all the wins for each player
hp_playerwins <- hp_playerwins %>% count(player) # counts wins per players

Attempting to use totals to predict number of wins of each player.

hp2019_totals <- hp2019_mb %>%
  group_by(player) %>%
  mutate(tot_kills = sum(kills)) %>%
  mutate(tot_deaths = sum(deaths)) %>%
  mutate(tot_x = sum(x)) %>%
  mutate(tot_spm = sum(player_spm)) %>%
  mutate(tot_assists = sum(assists)) %>%
  mutate(tot_damagedealt = sum(damage_dealt)) %>%
  mutate(tot_hilltime = sum(hill_time_s)) %>%
  mutate(tot_captures = sum(hill_captures)) %>%
  mutate(tot_defends = sum(hill_defends)) %>%
  subset(., select = -c(k_d, kills, deaths, x, assists, damage_dealt, hill_time_s, hill_captures, hill_defends, player_spm, win)) %>%
  unique()

# creates a final dataset for hardpoint where each player only has one row, which includes the averages of all of his statistics
hp2019_totals <- merge(hp_playerwins, hp2019_totals, by = c("player"))

Correlation plots and matrix for TOTALS

# # Equivalent with a formula
# pairs(n ~ tot_kills + tot_deaths + tot_spm + tot_assists + tot_damagedealt + tot_hilltime + tot_captures + tot_defends, data = hp2019_totals)
# 
# corr_hp2019_totals <- hp2019_totals %>%
#   subset(., select = -c(player))
# 
# cor(corr_hp2019_totals)

Relationship for TOTALS

# plot(hp2019_totals$n, hp2019_totals$tot_x, 
#      xlab = "Number of Wins", 
#      ylab = "Total Kills - Total Deaths",
#      title("Relationship between Number of Wins and Kill and Death Discrepancy")) +
#   xlim(0, 50) +
#   ylim(-200, 300)

Regression model for Hardpoint for TOTALS

# lm_totals <- lm(n ~ tot_kills + tot_deaths + tot_spm + tot_assists + tot_damagedealt + tot_hilltime + tot_captures + tot_defends, data = hp2019_totals)
# 
# summary(lm_totals)

Models: Player HP

I will be trying to predict whether an individual player will win or lose a game based on his statistics in the given game.

Splitting Data:

hp2019_wl <- hp2019

set.seed(3068)

hp2019_wlsplit <- hp2019_wl %>%
  initial_split(prop = 0.8, strata = "win")

hp2019_train <- training(hp2019_wlsplit)
hp2019_test <- testing(hp2019_wlsplit)


head(hp2019_train)
##    player  k_d role win kills deaths   x assists damage_dealt player_spm
## 4   Abezy 0.66    1   0    19     29 -10       6         3891      290.9
## 5   Abezy 1.18    1   0    26     22   4       8         4480      393.3
## 12  Abezy 0.88    1   0    22     25  -3      14         4515      322.3
## 15  Abezy 0.76    1   0    19     25  -6       8         4868      295.7
## 16  Abezy 0.83    1   0    20     24  -4       4         3954      269.5
## 20  Abezy 1.33    1   0    28     21   7       7         4733      400.3
##    hill_time_s hill_captures hill_defends
## 4           48             4            5
## 5           55             5           14
## 12          80             9            6
## 15          77             7           11
## 16          35             3            6
## 20          71             6           10
dim(hp2019_train)
## [1] 3551   13
dim(hp2019_test)
## [1] 889  13
prop.table(table(hp2019_train$win))
## 
##         0         1 
## 0.4987328 0.5012672

Model 1: Decision Tree

Internet Based

Creating a general decision tree specification using rpart:

tree_spec <- decision_tree() %>%
  set_engine("rpart")

class_tree_spec <- tree_spec %>%
  set_mode("classification")

class_tree_fit <- class_tree_spec %>%
  fit(win ~ k_d + assists + damage_dealt + player_spm + hill_time_s + hill_captures + hill_defends, data = hp2019_train)

class_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot()
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

Checking confusion matrix and accuracy of the train data:

augment(class_tree_fit, new_data = hp2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class)
##           Truth
## Prediction    0    1
##          0 1379  601
##          1  392 1179
augment(class_tree_fit, new_data = hp2019_train) %>%
  accuracy(truth = win, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.720

Lab 7 Based

class_tree_wf <- workflow() %>%
  add_model(class_tree_spec %>% set_args(cost_complexity = tune())) %>%
  add_formula(win ~ .)
set.seed(3068)
hp2019_fold <- vfold_cv(hp2019_train)

param_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)

tune_res <- tune_grid(
  class_tree_wf, 
  resamples = hp2019_fold, 
  grid = param_grid, 
  metrics = metric_set(accuracy)
)
autoplot(tune_res)

best_complexity <- select_best(tune_res)

class_tree_final <- finalize_workflow(class_tree_wf, best_complexity)

class_tree_final_fit <- fit(class_tree_final, data = hp2019_train)
class_tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot()
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.


Model 2: Random Forest

# library(randomForest)
# library(datasets)
# library(caret)

set.seed(306)

rf <- randomForest(win ~ k_d + role + assists + damage_dealt + player_spm + hill_time_s + hill_captures + hill_defends, data = hp2019_train, ntree = 500)

print(rf)
## 
## Call:
##  randomForest(formula = win ~ k_d + role + assists + damage_dealt +      player_spm + hill_time_s + hill_captures + hill_defends,      data = hp2019_train, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 26.87%
## Confusion matrix:
##      0    1 class.error
## 0 1291  480   0.2710333
## 1  474 1306   0.2662921
rfAccuracy <- (1291 + 1306) / (1291 + 1306 + 480 + 474)
rfAccuracy
## [1] 0.7313433
importance(rf)
##               MeanDecreaseGini
## k_d                  399.78105
## role                  57.25065
## assists              153.92702
## damage_dealt         245.64875
## player_spm           311.43135
## hill_time_s          313.32427
## hill_captures        115.84950
## hill_defends         172.24218
varImpPlot(rf)

pred1=predict(rf,type = "prob")

library(ROCR)

perf = prediction(pred1[,2], hp2019_train$win)

# 1. Area under curve
auc = performance(perf, "auc")
auc
## A performance instance
##   'Area under the ROC curve'
# 2. True Positive and Negative Rate
pred3 = performance(perf, "tpr","fpr")

# 3. Plot the ROC curve
plot(pred3,main="ROC Curve for Random Forest",col=2,lwd=2)
abline(a=0,b=1,lwd=2,lty=2,col="gray")

plot(rf)


Models: Player SND

Splitting Data:

set.seed(1)

snd2019_split <- snd2019 %>%
  initial_split(prop = 0.8, strata = "win")

snd2019_train <- training(snd2019_split)
snd2019_test <- testing(snd2019_split)


head(snd2019_train)
## # A tibble: 6 × 18
## # Groups:   player [1]
##   match_id      team  player role  win   kills deaths   k_d assists damage_dealt
##   <chr>         <chr> <chr>  <fct> <fct> <dbl>  <dbl> <dbl>   <dbl>        <dbl>
## 1 737440468739… eUni… Abezy  1     0        11      7  1.57       0         1355
## 2 144010197940… eUni… Abezy  1     0         7      7  1          1         1938
## 3 259560665349… eUni… Abezy  1     0         2      7  0.29       4          796
## 4 146217688303… eUni… Abezy  1     0         7      7  1          0         1275
## 5 177954718572… eUni… Abezy  1     0         9      8  1.12       1         1561
## 6 167778343948… eUni… Abezy  1     0        10      9  1.11       3         1434
## # … with 8 more variables: player_spm <dbl>, bomb_sneak_defuses <dbl>,
## #   bomb_plants <dbl>, bomb_defuses <dbl>, snd_rounds <dbl>,
## #   snd_firstbloods <dbl>, fb_round_ratio <dbl>, fb_avg <dbl>
dim(snd2019_train)
## [1] 2791   18
dim(snd2019_test)
## [1] 699  18
prop.table(table(snd2019_train$win))
## 
##        0        1 
## 0.498746 0.501254

Model 1: Decision Tree

Creating a general decision tree specification using rpart:

tree_spec <- decision_tree() %>%
  set_engine("rpart")

class_tree_spec <- tree_spec %>%
  set_mode("classification")

class_tree_fit <- class_tree_spec %>%
  fit(win ~ k_d + role + assists + damage_dealt + 
        player_spm + bomb_sneak_defuses + bomb_plants +
        bomb_defuses + fb_round_ratio + snd_firstbloods, 
      data = snd2019_train)

class_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot()
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

Checking confusion matrix and accuracy of the train data:

augment(class_tree_fit, new_data = snd2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class)
##           Truth
## Prediction    0    1
##          0 1099  544
##          1  293  855
class_tree <- augment(class_tree_fit, new_data = snd2019_train) %>%
  accuracy(truth = win, estimate = .pred_class)
class_tree
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.700

Model 2: Logistic Regression

Recipe, Engine, and Workflow:

snd_recipe <- recipe(win ~ k_d + role + assists + damage_dealt + player_spm + bomb_plants + snd_firstbloods + bomb_defuses + bomb_sneak_defuses, data = snd2019_train)

log_reg <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

log_wkflow <- workflow() %>% 
  add_model(log_reg) %>% 
  add_recipe(snd_recipe)

log_fit <- fit(log_wkflow, snd2019_train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
log_fit %>% 
  tidy()
## # A tibble: 11 × 5
##    term               estimate std.error statistic  p.value
##    <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)        -1.33     0.162       -8.24  1.74e-16
##  2 k_d                 3.76     0.219       17.2   2.47e-66
##  3 role2              -0.0230   0.102       -0.224 8.23e- 1
##  4 role3               0.145    0.126        1.16  2.47e- 1
##  5 assists             0.476    0.0348      13.7   1.74e-42
##  6 damage_dealt       -0.00117  0.000128    -9.16  5.22e-20
##  7 player_spm         -0.0273   0.00340     -8.03  1.01e-15
##  8 bomb_plants         0.298    0.0525       5.67  1.39e- 8
##  9 snd_firstbloods     0.0838   0.0496       1.69  9.13e- 2
## 10 bomb_defuses        0.225    0.121        1.85  6.36e- 2
## 11 bomb_sneak_defuses  0.168    0.303        0.553 5.80e- 1

Assessing Model Performance:

predict(log_fit, new_data = snd2019_train, type = "prob")
## # A tibble: 2,791 × 2
##    .pred_0 .pred_1
##      <dbl>   <dbl>
##  1   0.321   0.679
##  2   0.735   0.265
##  3   0.689   0.311
##  4   0.587   0.413
##  5   0.535   0.465
##  6   0.431   0.569
##  7   0.576   0.424
##  8   0.766   0.234
##  9   0.478   0.522
## 10   0.450   0.550
## # … with 2,781 more rows
augment(log_fit, new_data = snd2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class)
##           Truth
## Prediction    0    1
##          0 1099  489
##          1  293  910
augment(log_fit, new_data = snd2019_train) %>%
  conf_mat(truth = win, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Checking accuracy:

log_reg_acc <- augment(log_fit, new_data = snd2019_train) %>%
  accuracy(truth = win, estimate = .pred_class)
log_reg_acc
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.720

Model 3: Support Vector Machine

svm_rbf_spec <- svm_rbf() %>%
  set_mode("classification") %>%
  set_engine("kernlab")

svm_rbf_fit <- svm_rbf_spec %>%
  fit(win ~ k_d + role + assists + damage_dealt + player_spm + bomb_plants + snd_firstbloods + bomb_defuses + bomb_sneak_defuses, data = snd2019_train)

augment(svm_rbf_fit, new_data = snd2019_test) %>%
  conf_mat(truth = win, estimate = .pred_class)
##           Truth
## Prediction   0   1
##          0 261 117
##          1  88 233
svm <- augment(svm_rbf_fit, new_data = snd2019_test) %>%
  accuracy(truth = win, estimate = .pred_class)

Model 4: Random Forest

# library(randomForest)
# library(datasets)
# library(caret)

set.seed(306)

rf <- randomForest(win ~ k_d + role + assists + damage_dealt + player_spm + bomb_plants + snd_firstbloods + bomb_defuses + bomb_sneak_defuses, data = snd2019_train, ntree = 500)

print(rf)
## 
## Call:
##  randomForest(formula = win ~ k_d + role + assists + damage_dealt +      player_spm + bomb_plants + snd_firstbloods + bomb_defuses +      bomb_sneak_defuses, data = snd2019_train, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 29.95%
## Confusion matrix:
##      0   1 class.error
## 0 1039 353   0.2535920
## 1  483 916   0.3452466
rfAccuracy <- (1039 + 916) / (1039 + 940 + 353 + 483)
rfAccuracy
## [1] 0.6944938
importance(rf)
##                    MeanDecreaseGini
## k_d                      370.890973
## role                      61.347862
## assists                  122.964469
## damage_dealt             303.306436
## player_spm               319.524345
## bomb_plants               63.491990
## snd_firstbloods           70.803857
## bomb_defuses              29.122384
## bomb_sneak_defuses         7.346533
varImpPlot(rf)

pred1=predict(rf,type = "prob")

library(ROCR)

perf = prediction(pred1[,2], snd2019_train$win)

# 1. Area under curve
auc = performance(perf, "auc")
auc
## A performance instance
##   'Area under the ROC curve'
# 2. True Positive and Negative Rate
pred3 = performance(perf, "tpr","fpr")

# 3. Plot the ROC curve
plot(pred3,main="ROC Curve for Random Forest",col=2,lwd=2)
abline(a=0,b=1,lwd=2,lty=2,col="gray")


Comparing Model Performance

accuracies <- c(class_tree$.estimate, log_reg_acc$.estimate, svm$.estimate, rfAccuracy)
accuracies
## [1] 0.7001075 0.7198137 0.7067239 0.6944938